[Enhancement] Implement metrics reporting for `MemTrackerManager` by arin-mirza · Pull Request #68170 · StarRocks/starrocks

arin-mirza · 2026-01-20T11:06:06Z

Why I'm doing:

There are currently no backend metrics reporting for memory pools.

I previously tried to add them by extending the workgroup metrics, but this turned out to be an incorrect approach:

[Enhancement] Add mem_pool_use_ratio metric to be #67701

What I'm doing:

This PR implements metric reporting for MemTrackerManager and adds the following new metrics:

mem_pool_mem_limit_bytes
mem_pool_mem_usage_bytes
mem_pool_mem_usage_ratio
mem_pool_workgroup_count

The implementation follows the same locking structure that is present in WorkGroupManager.

It was necessary to add a new mutex for MemTrackerManager because the update_metrics callback hook passed to MetricRegistry needs to be a closure which captures a write lock.
The unlocked gap inside add_metrics method is unavoidable to AB-BA deadlock scenario with the metrics collector.
Metrics entries are never deleted as it would complicate the thread synchronization even further. This is also the case for the existing implementation in WorkGroupManager.

Minor: Changed list_mem_trackers() method to not return the default memory pool name.

Tests and Docs

I did not add any test cases as there were not any for workgroup metrics either. Let me know if this is necessary.
I did not verify that the new metrics are being reported correctly by building and running the starrocks fe/be, as I am currently unable to build the engine locally.
I updated the user documentation.
- I am not a Chinese or Japanese speaker so I used AI for the translation. I would appreciate it if a native speaker could review my additions to ensure the tone is correct. :)

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
- This pr needs auto generate documentation
This is a backport pr

Bugfix cherry-pick branch check:

StarRocks-Reviewer · 2026-01-20T11:09:47Z

@cursor review

StarRocks-Reviewer · 2026-01-20T16:07:27Z

@cursor review

be/src/exec/workgroup/mem_tracker_manager.cpp

arin-mirza · 2026-01-25T17:42:32Z

@alvin-celerdata @kevincai I closed the previous PR where you were reviewers, can I get a review for this one, please? :)

alvin-celerdata · 2026-01-25T18:57:09Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7a13d9f6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

be/src/exec/workgroup/mem_tracker_manager.h

StarRocks-Reviewer · 2026-01-27T08:30:35Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

arin-mirza · 2026-01-28T10:36:53Z

@alvin-celerdata Can this PR be merged? Is there anything else that needs to be done?

Copilot

Pull request overview

This PR implements metrics reporting for MemTrackerManager to expose memory pool usage statistics. Previously, there were no backend metrics for memory pools. The implementation adds four new metrics: mem_pool_mem_limit_bytes, mem_pool_mem_usage_bytes, mem_pool_mem_usage_ratio, and mem_pool_workgroup_count. The implementation follows the established locking and metrics registration patterns from WorkGroupManager to avoid deadlocks with the metrics collector.

Changes:

Added metrics infrastructure to MemTrackerManager with thread-safe registration and update mechanisms
Updated list_mem_trackers() to exclude the default memory pool, improving consistency
Added documentation in English, Chinese, and Japanese for the new metrics

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
be/src/exec/workgroup/mem_tracker_manager.h	Added `MemTrackerMetrics` struct, metrics-related private methods, and mutex for thread synchronization
be/src/exec/workgroup/mem_tracker_manager.cpp	Implemented metrics registration, update logic, and modified `list_mem_trackers()` to exclude default pool
be/test/exec/workgroup/work_group_manager_test.cpp	Updated test expectations to reflect that default memory pool is no longer included in the list, removed unnecessary sleep
docs/en/administration/management/monitoring/metrics.md	Added English documentation for the four new metrics
docs/zh/administration/management/monitoring/metrics.md	Added Chinese documentation for the four new metrics
docs/ja/administration/management/monitoring/metrics.md	Added Japanese documentation for the four new metrics

kevincai · 2026-02-13T03:39:46Z

docs/en/administration/management/monitoring/metrics.md

 - Unit: -
 - Description: Ratio of internal table scan thread time slices used by each resource group to the total used by all resource groups. This is an average value over the time interval between two metric retrievals.

+### mem_pool_mem_limit_bytes


the actual metric name will be prefixed by starrocks_be_, so to end user it is actually starrocks_be_mem_pool_mem_limit_bytes, shall use the final name since this is the doc to end user.

I updated the documentation to use starrocks_be_ prefix.

I think the documentation is inconsistent in this regard.

Most resource group related metrics are duplicated, a description exists with and without the starrocks_be prefix. I also checked Datadog metrics to see if they are actually reported twice, but no, they are always reported under a name with the starrocks_be prefix.

In metrics.md:

Without Prefix With Prefix in Datadog

resource_group_mem_limit_bytes starrocks_be_resource_group_mem_limit_bytes with prefix

resource_group_mem_inuse_bytes - with prefix

resource_group_cpu_limit_ratio starrocks_be_resource_group_cpu_limit_ratio with prefix

resource_group_cpu_use_ratio starrocks_be_resource_group_cpu_use_ratio with prefix

resource_group_scan_use_ratio - with prefix

resource_group_inuse_cpu_cores - with prefix

resource_group_connector_scan_use_ratio - with prefix

- starrocks_be_resource_group_mem_allocated_bytes does not exist

I believe all metrics above should only have one entry with the starrocks_be prefix.

Moreover, I could not find starrocks_be_resource_group_mem_allocated_bytes in Datadog. This metric was renamed to resource_group_mem_inuse_bytes in 6204611 on 2022-06-29. The entry should be removed from the user documentation if safe to do so.

Yeah, I think the doc is messed up somehow, the actual metrics name produced via /metrics should be the canonical one in this doc.

Just keep the new added ones prefixed with the starrocks_be_, other inconsistent ones will be fixed in a dedicated PR.

be/src/exec/workgroup/mem_tracker_manager.cpp

kevincai · 2026-02-13T05:54:44Z

be/src/exec/workgroup/mem_tracker_manager.cpp

+                metrics->workgroup_count->set_value(child_count);
+            }
+        } else {
+            // Metrics entries for deleted shared_mem_trackers are never deleted, but simply set to 0.


in a long running system with Frequent workergroup creation and deletion, will thes garbage metrics accumulated and cause memory occupation and long useless serialization to the /metrics interface.

Not really.

Workgroups by default belong to DEFAULT_MEM_POOL , these are not tracked in MemTrackerManager and it does not report mempool statistics for such work groups. Frequent creation and deletion of workgroups under a memory pool is a special case.

In the general case, we are already paying the price of not deleting the metrics for workgroups. Compare my implementation with _update_metrics_unlocked() method in work_group.cpp:

starrocks/be/src/exec/workgroup/work_group.cpp

Lines 475 to 490 in cd9ba32

} else {

VLOG(2) << "workgroup update_metrics " << name << ", workgroup not exists so cleanup metrics";

wg_metrics->cpu_limit->set_value(0);

wg_metrics->inuse_cpu_ratio->set_value(0);

wg_metrics->inuse_scan_ratio->set_value(0);

wg_metrics->inuse_connector_scan_ratio->set_value(0);

wg_metrics->mem_limit->set_value(0);

wg_metrics->inuse_mem_bytes->set_value(0);

wg_metrics->connector_scan_mem_bytes->set_value(0);

wg_metrics->running_queries->set_value(0);

wg_metrics->total_queries->set_value(0);

wg_metrics->concurrency_overflow_count->set_value(0);

wg_metrics->bigquery_count->set_value(0);

wg_metrics->inuse_cpu_cores->set_value(0);

}

Your concern of garbage metrics being accumulated applies here even more. I believe the reason it was implemented this way is that deleting metrics entries make handling race conditions more complicated and error-prone.

I can change my implementation so that unused metrics are deleted instead of being set to 0. However, I am not sure if it is worth the extra effort as work_group.cpp does not delete its own metrics anyway.

Let me know which one you prefer.

ok. will see if other reviewers have some thought. I am ok to keep as is for now.

github-actions · 2026-02-13T13:21:34Z

🌎 Translation Required?

✅ All translation files are up to date.
Great job! No translation actions are required for this PR.

🕒 Last updated: Fri, 13 Feb 2026 15:26:43 GMT

conf/be.conf

docs/zh/administration/management/monitoring/metrics.md

docs/ja/administration/management/monitoring/metrics.md

Signed-off-by: arin-mirza <a.mirza@celonis.com>

arin-mirza · 2026-02-13T15:05:56Z

I rebased onto the latest main.

Signed-off-by: arin-mirza <a.mirza@celonis.com>

github-actions · 2026-02-13T16:40:57Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2026-02-13T16:41:03Z

[FE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2026-02-13T16:45:37Z

[BE Incremental Coverage Report]

❌ fail : 41 / 68 (60.29%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/exec/workgroup/mem_tracker_manager.cpp	39	66	59.09%	[120, 144, 145, 146, 147, 149, 150, 152, 153, 154, 155, 156, 158, 159, 161, 162, 164, 165, 169, 170, 172, 173, 175, 176, 178, 179, 183]
🔵	be/src/exec/workgroup/mem_tracker_manager.h	2	2	100.00%	[]

arin-mirza requested a review from a team as a code owner January 20, 2026 11:06

github-actions bot added behavior_changed documentation Improvements or additions to documentation labels Jan 20, 2026

mergify bot assigned arin-mirza Jan 20, 2026

arin-mirza mentioned this pull request Jan 20, 2026

[Enhancement] Add mem_pool_use_ratio metric to be #67701

Closed

25 tasks

github-actions bot added 4.1 and removed behavior_changed labels Jan 20, 2026

cursor bot reviewed Jan 20, 2026

View reviewed changes

be/src/exec/workgroup/mem_tracker_manager.cpp Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Jan 25, 2026

View reviewed changes

be/src/exec/workgroup/mem_tracker_manager.h Show resolved Hide resolved

cursor bot reviewed Jan 27, 2026

View reviewed changes

stdpain self-assigned this Jan 30, 2026

kevincai requested review from Copilot and kevincai February 6, 2026 03:39

Copilot started reviewing on behalf of kevincai February 6, 2026 03:39 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

kevincai requested a review from trueeyu February 13, 2026 03:36